MAS 635 - Machine Learning Methods
University of Miami
2026-02-01
Goal: Predict HDL cholesterol levels
Dataset: NHANES (National Health and Nutrition Examination Survey)
Approach: Machine Learning & Deep Learning
Key Stats:
High-Density Lipoprotein (HDL) - “Good Cholesterol”
National Health and Nutrition Examination Survey (NHANES)
LBDHDD_outcome - Direct HDL cholesterol (mg/dL)Normal Ranges:
Key Observations:
Top Correlations with HDL:
Important
Strong Negative Correlation: Higher BMI → Lower HDL
Key Findings:
Clinical Insight: Abdominal obesity strongly associated with lower HDL
Baseline Models (7):
Deep Learning (2):
Regularization:
Observations:
| Model | RMSE ↓ | MAE ↓ | R² ↑ |
|---|---|---|---|
| Gradient Boosting | 5.0312 | 3.9438 | 0.6963 |
| CatBoost | 5.0388 | 3.9370 | 0.6954 |
| XGBoost | 5.1887 | 4.1054 | 0.6770 |
| Elastic Net | 5.8996 | 4.6086 | 0.5825 |
| Basic NN | 6.5471 | 5.1350 | 0.4858 |
| Advanced NN | 7.3001 | 5.8338 | 0.3607 |
Note
Final Approach: Stacking ensemble with Optuna-tuned models and Ridge meta-learner
Approach: Out-of-fold stacking with Ridge meta-learner
Base Models:
Stacking OOF RMSE:
| Model | OOF RMSE |
|---|---|
| XGBoost | 4.7031 |
| CatBoost | 4.7318 |
| GradBoost | 4.8308 |
| Stacked | 4.6434 |
Why Stacking?: Meta-learner finds optimal model combination, achieving 1.27% improvement over best single model
Validation: Test predictions align well with training distribution
Modifiable Risk Factors:
Non-Modifiable Factors:
For Healthcare Providers:
For Patients:
Warning
Important Considerations:
Technical:
Clinical:
Contact:
Resources:
Computing Environment:
Training Time:
| Variable | Description | Type |
|---|---|---|
| LBDHDD_outcome | HDL Cholesterol (mg/dL) | Target |
| RIAGENDR | Gender (1=M, 2=F) | Categorical |
| RIDAGEYR | Age (years) | Numeric |
| BMXBMI | Body Mass Index | Numeric |
| BMXWAIST | Waist Circumference (cm) | Numeric |
Full data dictionary available in repository
MAS 635 - HDL Prediction Project